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Abstract 

The Expectation-Maximization (EM) algorithm is an iterative approach to maximum likelihood pa- 
rameter estimation. Jordan and Jacobs (1993) recently proposed an EM algorithm for the mixture of 
experts architecture of Jacobs, Jordan, Nowlan and Hinton (1991) and the hierarchical mixture of ex- 
perts architecture of Jordan and Jacobs (1992). They showed empirically that the EM algorithm for 
these architectures yields significantly faster convergence than gradient ascent. In the current paper 
we provide a theoretical analysis of this algorithm. We show that the algorithm can be regarded as a 
variable metric algorithm with its searching direction having a positive projection on the gradient of the 
log likelihood. We also analyze the convergence of the algorithm and provide an explicit expression for 
the convergence rate. In addition, we describe an acceleration technique that yields a significant speedup 
in simulation experiments. 
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Introduction 

Although neural networks are capable in principle of representing complex nonlinear functions, 
the time required to train a complex network does not always scale well with problem size 
and the solution obtained does not always reveal the structure in the problem. Moreover, it is 
often difficult to express prior knowledge in the language of fully-connected neural networks. 
Achieving better scaling behavior, better interpretability of solutions and better ways of incor- 
porating prior knowledge may require a more modular approach in which the learning problem 
is decomposed into sub-problems. Such an approach has been used with success in the statistics 
literature and the machine learning literature, where decision-tree algorithms such as CART 
and ID3 and multivariate spline algorithms such as MARS have running times that can be 
orders of magnitude faster than neural network algorithms and often yield simple, interpretable 
solutions (Breiman, Friedman, Olshen & Stone, f984; Friedman, 1991; Quinlan, 1986). 

A general strategy for designing modular learning systems is to treat the problem as one 
of combining multiple models, each of which is defined over a local region of the input space. 
Jacobs, Jordan, Nowlan and Hinton (1991) introduced such a strategy with their "mixture of 
experts" (ME) architecture for supervised learning. The architecture involves a set of function 
approximators ("expert networks") that are combined by a classifier ("gating network"). These 
networks are trained simultaneously so as to split the input space into regions where particular 
experts can specialize. Jordan and Jacobs (1992) extended this approach to a recursively- defined 
architecture in which a tree of gating networks combine the expert networks into successively 
larger groupings that are defined over nested regions of the input space. This "hierarchical 
mixture of experts" (HME) architecture is closely related to the decision tree and multivariate 
spline algorithms. 

The problem of training a mixture of experts architecture can be treated as a maximum 
likelihood estimation problem. Both Jacobs et al. (1991) and Jordan and Jacobs (1992) derived 
learning algorithms by computing the gradient of the log likelihood for their respective archi- 
tectures. Empirical tests revealed that although the gradient approach succeeded in finding 
reasonable parameter values in particular problems, the convergence rate was not significantly 
better than that obtained by using gradient methods in multi-layered neural network archi- 
tectures. The gradient approach did not appear to take advantage of the modularity of the 
architecture. An alternative to the gradient approach was proposed by Jordan and Jacobs (in 
press), who introduced an Expectation-Maximization (EM) algorithm for mixture of experts 
architectures. EM is a general technique for maximum likelihood estimation that can often 
yield simple and elegant algorithms (Baum, Petrie, Soules & Weiss, 1970; Dempster, Laird & 
Rubin, 1977). For mixture of experts architectures, the EM algorithm decouples the estimation 
process in a manner that fits well with the modular structure of the architecture. Moreover, 
Jordan and Jacobs (in press) observed a significant speedup over gradient techniques. 

In this paper, we provide further insight into the EM approach to mixtures of experts 
architectures via a set of convergence theorems. We study a particular variant of the EM 
algorithm proposed by Jordan and Jacobs (in press) and demonstrate a relationship between 
this algorithm and gradient ascent. We also provide theorems on the convergence rate of the 
algorithm and provide explicit formulas for the constants. 

The remainder of the paper is organized as follows. Section 2 introduces the ME model. 
The EM algorithm for this architecture is derived and two convergence theorems are presented. 
Section 3 presents an analogous derivation and a set of convergence results for the HME model. 
Section 4 introduces two acceleration techniques for improving convergence and presents the 
results of numerical experiments. Section 5 presents our conclusions. 
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Figure 1: The mixture of experts architecture. The total output fi is the weighted sum of the 
expert network outputs: fi = gifii +#2/^2? where the weights are the gating network outputs 
g x and g 2 . 

Theoretical analysis of an EM algorithm for the mixture of 
experts architecture 

Network learning based on maximum likelihood estimation 

We begin by studying the non-hierarchical case. As shown in Figure 1, the mixture of experts 
(ME) architecture is comprised of K expert networks, each of which solves a function approx- 
imation problem over a local region of the input space. To each expert network we associate 
a probabilistic model that relates input vectors x £ R n to output vectors y £ R m . We denote 
these probabilistic models as follows 

P(y\x,0 j ),j = l,2,---,K, 

where the Oj are parameter vectors. Each of these probability densities is assumed to belong 
to the exponential family of densities (Jordan & Jacobs, in press). The j expert network 
produces as output a parameter vector /a 

fij = fj(x,0j), j = 1,2,- --,K, 

which is the location parameter for the j probability density. In the current paper, as in 
Jordan and Jacobs (in press), we treat the case in which the functions fj are linear in the 
parameters. We extend our results to the case of experts that are nonlinear in the parameters 
in the Appendix. 

We also assume, for simplicity, that the probability densities P(y|x,0j) are gaussian, im- 
plying that the location parameter is simply the mean. We associate a covariance matrix £j 



with each expert network, yielding the following probabilistic model for expert j 

P(y|x,^) = (2vrdetS J )^exp{-i[y-f J (x,^)] T S- 1 [y-f J (x,^)]}- (1) 

The ME architecture also utilizes a auxiliary network known as a gating network, whose job 
it is to partition the input space into regions corresponding to the various expert networks. This 
is done by assigning a probability vector [g\,g2, • • ',9k] T to each point in the input space. In 
particular, the gating net implements a parameterized function s : R n —> R and a normalizing 
function g : R K —> R K such that 

9j = </j(x, g ) = —k — -> 2 = 1, ' ' ' , K, (2) 

E,=i e * 

which satisfies 

K 

H#.?'( X A) = 1 ' foranyx,0 3 . (3) 

In the current paper we focus on the case in which the function s is linear (cf. Jordan & Jacobs, 
in press). In this case the boundaries gi = g^ are planar and the function g can be viewed as a 
smoothed piecewise-planar partitioning of the input space. 

Training data are assumed to be generated according to the following probability model. 
We assume that for a given x, a label j is selected with probability P(j|x) = gj(x,0 g ). An 
output y is then chosen with probability P(y|x,0j). Thus the total probability of observing y 
from x is given by the following finite mixture density 

K K 

P(y|x) = ^P(j|x)P(y|x,^) = ^^(x,^)P(y|x,^)- (4) 

3=1 3=1 

A training set y = {(x' f ), y' f '), t = 1, • • • , N} is assumed to be generated as an independent set 
of draws from this mixture density. Thus the total probability of the training set, for a specified 
set of input vectors {x' f '}^_ 1 , is given by the following likelihood function 

N 

i = p({y (f) } i v l{xW}f) = rP(y ( V f) ) (5) 

t=l 

N K 

= nE^( x(f) '^) p (y (f) l x(f) ,^)- (6) 

t=i J= i 

The learning algorithms that we discuss are all maximum likelihood estimators. That is, 
we treat learning as the problem of finding parameters g , Oj, and Sj to maximize L, or, more 
conveniently, to maximize the log likelihood / = In L 

N K 

l(G,y) = X>X>,-(xW 0,)P(yW|x W ,*;), 

*=1 3=1 

where = [0 g , 1 , 1 , ■ ■ ■ , K , Si, S 2 , • • •, Sk] T . 

Given the probability model in Eq. 4, the expected value of the output is given as follows 

K 
fi = P[y|x] = Y^9j(x,0g)Pj- 

3 = 1 



This motivates using the weighted output of the expert networks as the total output of the ME 

architecture (cf. Figure 1). 

The model in Eqs. (4) and (1) is a finite gaussian mixture model. It is interesting to 

compare this model to a related gaussian mixture model that is widely studied in statistics; 

i.e., the model 

K K 

P{*) = Y, a i P ?\*\ e i)> «j>0, 5>i = 1 - (7) 

j=i j=i 

The difference between these models is clear: the afs in Eq. (7) are independent of the input 
vectors, while the gfs in Eq. (4) are conditional on x (they represent the probabilities P(j|x)). 
Thus model (7) represents a unconditional probability, appropriate for unsupervised learning, 
while model (4) represents a conditional probability, appropriate for supervised learning. 

There is another model studied in statistics, the switching regression model (Quandt & 
Ramsey, 1972, 1978, De Veaux, 1986), that is intermediate between model (7) and model (4). 
The switching regression model is given as follows 

P(y|x) = AP(y|x, X ) + (1 - A)P(y|x, 2 ), (8) 

where the P(y|x,#i) are univariate gaussians and the mean of each gaussian is assumed to be 

linear in x. This model assumes that the data pair {y, x} is generated from a pair of linear 

regression models through a random switch which turns to one side with probability A and to 

the other side with probability 1 — A. This model can be generalized to allow for a multinomial 

switch 

K 

P(y|x) = ^a J P(y|x,^), (9) 

where aj > 0, J2?=i a j = 1 an( i -P(y| x ?^j) i s given by Eq. (1). The difference between 
switching regression and the ME model is that the switching regression model assumes that 
the setting of the switch is independent of the input vector. This assumption does not allow 
for piecewise variation in the form of the regression surface; all of the regression components 
contribute throughout the input space. Switching regression can be viewed as one end of a 
continuum in which the overlap in the regression components is total; decision tree models 
(e.g., Breiman et al., 1984) are the other end of the continuum in which the overlap is zero. 
The ME model interpolates smoothly between these extremes. 

An EM algorithm for training the mixture of experts 

In many estimation problems the likelihood is a complicated nonlinear function of the param- 
eters. Parameter estimation in such cases usually involves some sort of numerical optimization 
technique, typically gradient ascent. An alternative to gradient techniques, applicable in many 
situations, is the "Expectation-Maximization" or "EM" algorithm (Baum, Petrie, Soules & 
Weiss, 1970; Dempster, Laird & Rubin, 1977). EM is based on the idea of solving a succession 
of simplified problems that are obtained by augmenting the original observed variables with a 
set of additional "hidden" variables. Unconditional mixture models are particularly amenable 
to the EM approach (Redner & Walker, 1984) and, as observed by Jordan and Jacobs (in press), 
the conditional mixture of experts model is also amenable to an EM approach. 

Given an observed data set y, we augment y with a set of additional variables y m i s , called 
"missing" or "hidden" variables, and consider a maximum likelihood problem for a "complete- 
data" set Z = {y,y m i s } (cf. Little & Rubin, 1987). We choose the missing variables in such 



a way that the resulting "complete-data log likelihood," given by l c (0,Z) = In P(y,y m i s \0), 
is easy to maximize with respect to 0. The probability model P(y,y m i s \0) must be chosen 
so that its marginal distribution across y, referred to in this context as the "incomplete-data" 
likelihood, is the original likelihood 



p(y\&) = J p(y,y mi s\&)dy„ 



10) 



In deriving an update to the parameters based on the complete-data log likelihood, we first note 
that we cannot work directly with the complete-data log likelihood, because this likelihood is 
a random function of the missing random variables y m i s . The idea is to average out y m i s , that 
is, to maximize the expected complete-data log likelihood Ey mt Jin P(y, Jmisl®)]- This idea 
motivates the EM algorithm. 

The EM algorithm is an iterative algorithm consisting of two steps: 

• The Expectation (E) step, which computes the following conditional expectation of the 
log likelihood 



Q(0\0^) = Ey_{lnP(Z\0)\y,0^} 

= j p(y miS \y,® {k) )inP(z\0)dy mts (ii) 



where 0^ ' is the value of the parameter vector at iteration k. 
• The Maximization (M) step, which computes 

{k+1 ^ = argmax Q(0\0^). (12) 

The M step chooses a parameter value that increases the Q function; the expected value of 
the complete-data log likelihood. Dempster, Laird and Rubin (1977) proved that an iteration 
of EM also increases the original log likelihood /. That is, 

i(e^ k+1 )-y)>i{e^-y). 

Thus the likelihood / increases monotonically along the sequence of parameter estimates gen- 
erated by an EM algorithm. 

Although in many cases the solution to the M step can be obtained analytically, in other 
cases an iterative inner loop is required to optimize Q. Another possibility is to simply increase 
the value of Q during the M step 

Q(0( k + 1 )\0( k )) > Q(0( k )\0( k )) (13) 

by some means, for example by gradient ascent or by Newton's method. An algorithm with 
an M-step given by Eq. (13) is referred to as a generalized EM (GEM) algorithm (Dempster, 
Laird k Rubin, 1977). 

For the ME architecture we choose the missing data to be a set of indicator random variables 
Vmis = {P] t] J = l,---,K,t = l,---,N} with 

(t) J 1, if y' f ' is generated from the j'-th model given by Eq. (1), , 

3 | 0, otherwise 
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and 

K 

2_\i) = ~^-> f° r eacn t- 

We assume that the distribution of the complete data is given as follows 

N K 

p(z\0) = n nb,(x (f) ,^)P(y (f) |xW,^)]' . 

t=l 3 =1 

It is easy to verify that this distribution satisfies Eq. (10). 
From Eq. (11), we also obtain 

Q(0\0^) = Ey mts {lnP(Z\0)\y,0^} 

JV K 



^^/,f(01n[^(xW,^)P(yW|xW,^)] 
t=i j=i 

JV A' JV 



t=l j=l 

JV A' JV 

jWwi, „yJ')^x Y^/,( fc )r 
t=i j=i t=i 

JV 

••• + ^4J ) (01nP(yW|xW,^), (15) 

t=i 



where 



*(*)(*) = P[/f|J,0( fc )] = P( J |xW,yW) 
gj (xW,gW)P(yW|xW,gf) 

E(Ii^(xW,^))P(yW|xW,0fV 



(16) 



where P(j|x' f ',yW) denotes the probability that the pair {x' f ',yW} comes from the j th prob- 
ability model. Note that we always have h- '(t) > 0. 

With the Q function in hand, we now investigate the implementation of the M step. From 
Eq. (15), Eq. (1) and Eq. (2), we have 

u °g t= i J=1 u °g 

JV K n K n 

t=i J= i uo 3 i=1 u °g 

JV K ft 

= EE^fW-^(x (t) ,^)]|^, (17) 

7=t]=\ de " 



'g 



dQ ^ u(k) 8P(yM\xW,0 

d6 3 



% - Y.^ r^r^ ,^^^ 



"j t= i ^"1 

JV 



Y>t\t) ' ^' ^ STl[yW - f .( X W, *.)], j = 1, . ..,/,, (18) 



and 



H = |>j»> w a W-''W v->,«, 



N 



1 

3 =1,"-,K. (19) 



By letting g^-\ y _ y (*+i) = 0, we obtain the update for the covariance matrices 



- 1 } n 3- 



s 



(fc+1) 



JV 



Er=iM fc) (*)s 



^JV 



E^OIyW-fiCxW.^OlIyW-fiCxW,^ 



(20) 



Assuming that the training set y is generated by a mixture model, we note that when the the 
sample number N is sufficiently large (relative to the dimension of y), the space spanned by 
the N vectors [yW — fj(x' f ',0j)],i = 1, • • • , iV will be of full dimension with probability one. 

Recalling that h- '(t) > we observe that when the sample number N is sufficiently large the 

matrices £,• ' are therefore positive definite with probability one. 



Next, by letting -§§-\ e ._ 



.0(*+i) 



0, we obtain 



N 






(*), 



gf/(xw,^; 



s/VbrW-f^xW,^ 



0, 



f211 



which we can solve explicitly given our assumption that the expert networks are linear 



where 



i(*+i) 



N 



(RfYy^f) 



c f = E^ ) W^(sf)- 1 y (f) , 



1 A^l 1 

t=l 



N 



4 , = E^i (^(sfr 1 ^, 

t=l 



and 



xl 



?W 



(xW) 3 



(xW) T 







(x«) 3 



1 
1 

••• 



... o 
... o 

1 



(22) 



(23) 



(24) 



(25) 



Note that R y - ' is invertible with probability one when the sample size N is sufficiently large. 

Finally, let us consider the update for g . Jordan and Jacobs (in press) observed that the 
gating network is a specific form of a generalized linear model, in particular a multinomial logit 
model (cf. McCullagh & Nelder, 1984). Multinomial logit models can be fit efficiently with 
a variant of Newton's method known as iteratively reweighted least squares (IRLS). For the 
purposes of the current paper, we simply write the generic form of a Newton update and refer 
the reader to Jordan and Jacobs (in press) for further details on IRLS. Note also that Jordan 
and Jacobs (in press) assume that the inner loop of IRLS fitting runs to completion. In the 
current paper we address only the case in which a single IRLS step is taken in the inner loop. 
The form of this IRLS step is generalized to allow a learning rate parameter. 1 



Thus the algorithm that we analyze in this paper is, strictly speaking, a GEM algorithm. 



The update for the gating network parameters is obtained as follows. Denote the gradient 
vector at iteration k as 



-3 

and the Hessian matrix at iteration k as 

N K 



N K fl 

OS; 



ef»=EOT-^.«^ 



m) 



4 fc) = EE^(x (t) ,^)[l-^(xW,^)]|^^. (27) 

t=l 3 = 1 



l d9 g d0r 



Then the generalized IRLS update is given as follows 

^+i) = flW +7a (iEW)-i e ( fc ), (28) 

where j g is a learning rate. 

In summary, the parameter update for the model Eq. (4) is given as follows 

Algorithm 1 

1. (The E step): Compute the hf\ty S by Eq. (16). 

2. (The M step): Compute sj fc+1) by Eq. (20), compute 9 g k+1 ^ by Eq. (28), and also 
compute 0f +1 \j = 1, • • • , K by Eq. (22). 



Before closing this section, let us return to the switching regression model (Eq. 9). Following 
the same procedure as above, we obtain the following EM algorithm for switching regression. 

Algorithm 2 

1. (The E step): Compute the h- (i)'s by 

(k) ^(xW^yWlxWfl^) 

h) \t) = 3 m 3 m • (29) 

E (l ia f)(xW)P(yW|xW,^) 

2. (The M step): Compute sj fc+1) by Eq. (20), and let 

af +1) (xW) = hf\t). (30) 

Obtain 6- , j = 1, • • • , K in the same manner as for model Eq. (4). 



We see that the EM algorithm for switching regression is simpler, because the a- '(x' f ')'s 
are not constrained through a common parameter g as in the ME model. 



Theoretical convergence results 

In this section we provide a number of convergence results for the algorithm presented in the 
previous section. We study both the convergence and the convergence rate of the algorithm. 
In the Appendix we extend these results to a number of related algorithms. 

We begin with a convergence theorem that establishes a relationship between the EM algo- 
rithm and gradient ascent. 

Theorem 1 For the model given by Eq. (4) and the learning algorithm given by Algorithm 1, 
we have: 

e (k+i)_ e (k) _ p(fc)_^_i 

-pr>]--pf] = ^iw"^ 1 -^ (31) 

where I = In L is given by Eq. (6), Eq. (4) and Eq. (1), and "vec[A] v denotes the vector- 
obtained by stacking the column vectors of the matrix A. 

Moreover, assuming that the training set y is generated by the mixture model of Eqs. (4) 
and (1) and assuming that the number N is sufficiently large, we have that Pg ' is a positive 
definite matrix, and P- , P^ , j = 1, • • • , K are positive definite matrices with probability one. 
Specifically, they take the following values: 

(i) P ( g k) = lg (R g k ))-i with R { g k) given by Eq. (28). 

(ii) For j = 1, • • • , K, pf ] = (R^)- 1 with Rf ] given by Eq. (22). 
(Hi) For j = 1, • • • , K , 

pW = 2 S f } ® Sf > (32) 

where "(/)" denotes the Kronecker product. For a to X n matrix A and qx m matrix B, the 
Kronecker product A(x) B is defined as 



A®B 



Proof. From Eqs. (6), (4) and (1), for / = lni, we obtain the following derivatives: 

di , _ ^ gj (xW,f)p( y (')|xW,ef) d 9 ^),e g )ide g \ eg=e{gk) 



anB 


a 12 B ■ 


a ln B 


a 2 \B 


a 22 P • 


a2 n B 


a m \B 


a-miB 


ajnn-^ 



d0<8 9 =ei k) 



K p. 

EE[hf\t)- gj ^\^))]^-\ g= , kh (33) 



t 
N K 



t=l j=l 



'9 



dl , 



" ^(x« 0( fc ))P(y«|x« flf) 9P(yW|xW,^)/^| ej= ^) 



d6] %=e)> ^ L E (l l58 (xW,^))P(yW|xW,0fV P(yW|xW,^) 



J =h---,K, (34) 






5S .'E>=EW ^ L E ^ lft ( x W,^))p( y (t)| x (t),0f)) J P(yW|xW,0f) 

1 N 

J =1*~ 1 -,A', (35) 

We now prove points (i), (ii) and (iii). 

(i) Comparing Eq. (33) with Eq. (28) it follows that pf = T^-R?*) -1 - To show that Pf 
is positive definite, we show that R g ' is positive definite. 2 For an arbitrary vector u, from Eq. 
(28) we have 

N K a a 

t=l j=l aU 3 OVg 

N K 

= E E^( x(f) > *? } )[i - ^( x(f) > ^Vv > o, 
*=i j=i 

since gj(x.^\0^ ')[! — gj(~K^\0^ ')] > 0. Equality holds in the above equation only when 

v = [dsj/86 ]u = for any u, which is impossible. Thus we have established that R g ' (and 

thus also (Rg ) _1 ) is positive definite. 
4k) p (fc) 



(ii) Let C) ', R) ' be given by Eq. (22). From Eq. (21) and Eq. (34), we obtain 

[<A»-tfW] = &\.,.„j = i,..,K: 



Furthermore, it follows from Eq. (22) that 

e (k+i) = qW + ( R (kh-i c [k) _ e (k) 



3 ' v 3 ' 3 3 

= of + {Rf ) )- 1 [cf ) -{Rf ) )0f ) ]. (36) 

That is, we have 

e (k+i) _ e (k) {R {k) r id}_< j-i ... K 

and Pf ] = (RfY 1 . 

We now prove that R- ' is positive definite. For an arbitrary vector u, from Eq. (22 ) we 

have 

N N 

u T Rfu = J>f (0u T X t (sf )" 1 X t T u = 5>f (Ov^Ef )" X v > 0. 
t=i t=i 



"A matrix A is positive definite if and only if A is positive definite. 

10 



From the note immediately following Eq. (16)), we know that S^- given by Eq. (20) is positive 
definite and invertible for each k with probability one. Thus, with probability one, the equality 
of the above equation holds only when v = Xju = for any u, which is impossible. Thus we 
have established that R- ' (and thus also (R- ) _1 ) is positive definite with probability one. 
(iii) We consider Eq. (20) for updating S^- . This equation can be expanded as follows 

1 N 

Sf +1) = 5f + _ N \ £ h?\t)\yM - f,-(xW 0,)][y (f) - f,-(xW 9 3 f - sf 



DLi^rc* 



*=i 



2S 

Ef+ s^>^' ,37) 



where 



1 N 

v Sj = -Iz^WPPr 1 ^ - [y (f) - ^(xWflf )][y« - f^xW^f^Ksf ) 



It follows from Eq. (35) that 



V =— I 



That is, we have 



(fc+i) _ 2S j d/ | v (fc) 






Utilizing the identity vec[ABC] = (C T ® A)fec[P], we obtain 



(fc+l) _ / (fc) (fc) 9/ 



^ ll= ^^ (W 



Thus Pi, = ^ w 2 / t N — (£„• ® £„• ). Moreover, for an arbitrary matrix P, we have 

' EtliS (*) 



j j 



W ec[P] T (sf > ® Sf )) V ec[^] = tr(sf Vsf V T ) 
= tr((sf ) P) T (sf V)) = vec[Ef ] U] T vec[J:f ] U] > 

(k) 

where the equality holds only when Yj- 'U = 0, which is impossible with probability one since 

(k) 

U is arbitrary, and T, x - ' is, as indicated above, positive definite with probability one. Thus we 

Ik) 

have established that P^ / is positive definite with probability one. □ 

Theorem 1 can be used to establish a relationship between the step taken by the EM 
algorithm and the direction of steepest ascent. Recall that for a positive matrix P, we have 
-7j B—q > 0. This implies the following corollary. 



Corollary 1 Assume that the training set {yW,x' f ',i = 1,---,JV} comes from the mixture 
model of Eqs. (4) and (1) and that N is sufficiently large. With probability one, the search 
direction of the EM algorithm has a positive projection on the gradient ascent searching direction 
of 1 = hi. L. 
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That is, the EM algorithm can be viewed as a modified gradient ascent algorithm for 
maximizing / = In L. From Theorem 1, B changes with the iteration step k, thus, the EM 
algorithm can also be regarded as a variable metric gradient ascent algorithm. This algorithm 
searches in an uphill direction, so if the learning rate is appropriate, the searching process will 
converge to a local maximum or a saddle point of the likelihood / = In L. 

Similar results have been obtained for unsupervised mixture models by Xu and Jordan 
(1993) and for Hidden Markov Models by Baum and Sell (1968). See Xu and Jordan (1993) for 
further discussion of the relationships between these theorems. 

We now utilize a result from Xu and Jordan (1993) to establish the convergence of the 
parameters 0^ ' . We also provide convergence rates for both /(©' ') and 0^ ' . 

Theorem 2 Assume that the training set y is generated by the mixture model of Eqs. (4)(1) 
and that N is sufficiently large. Assume further that Si, • • •, S/ ( - are diagonal and let vs be a 
vector consisting of the diagonal elements of Sj. 
Let us denote 

= [0j,0f,---X-,V Sl ,...,V SA .] T 

P = diag[Pf ), Pi, • • • , P K , P Sl , • • • , P Sl J, and H(0) = .^® ) T - 

O0O0 

Furthermore, assume that on a given domain Dq 

(i) f J Jr , — J Jr , i = 1, • • • , K, i = 1, • • • , m exist and are continuous; 

■ 7 dO q d6 q dOfiO] 

(ii) the Hessian matrix H(0) is negative definite; 

(Hi) 0* is a local maximum of 1(0), and 0* £ Dq. 

Then with probability one, 

(1) Letting —M, —m ( here M > m > 0) be the minimum and maximum eigenvalues 
of the negative definite matrix (P2 ) T H(0)(P^ ) (or equivalently the minimum and maximum 
eigenvalues of PH(0), since we have PHe = Ae from (P^) T HP^e = Xe), we have 

1(0*) - /(©W) < r k [l(0*) - /(©„)], (38) 



p-5(©(*) - ©*)|| < | r |*/2,/— [/(©*) - /(©o)], (39) 

2 

f < 1- We also have < |r| < 1 when M < 2. 
(2) For any initial point 0q £ Dq, lim^oo ©^ > = 0* when M < 2. 



where r = 1 - (1 - 4f )^- < 1. We also have < \r\ < 1 when M < 2 



Proof. As indicated earlier, when the training set {yW,x' f ',i = 1,---,JV} is generated 
from the mixture model of Eqs. (4)(1) and N is sufficiently large, S^- ' remains positive definite 
during the learning process. Thus, under the condition (i), it follows from Eqs. (6), (4) and 
(1) that H(0) exists and remains continuous on Dq. Expanding the log likelihood in a Taylor 
expansion, we have 

1(0) - 1(0*) = (© - 0*f ^M| 0=0 , + 1(0 - 0*fH(0* + e(0 - 0*))(0 - 0*) 

with < £ < 1. Since a /a \q-Q* = 0, we have 

1(0) - 1(0*) = -(& - 0*) T H(0* + £(0 - 0*))(0 - 0*). (40) 
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From Theorem 1 we know that P is positive definite. Furthermore, from condition (ii), 
H(0) is negative definite on Dq. This implies that P% exists and (P2 ) T H(0)(P^ ) is negative 
definite on D@. Utilizing the Rayleigh quotient we obtain that for any u, 

- M||u|| 2 < u T (P^) T H(0)(P^)u < -m||u|| 2 . (41) 

Substituting Eq. (41) into Eq. (40), we obtain 

1(0) - 1(0*) = -(0 - 0*) T (P- 1 2) T (P 1 2) T H(0* + £(© - 0*))P^P-^(0 - 0*) (42) 

l(0)-l(0*)>- — \\p- 1 2(0-0*)\\ 2 (43) 



Moreover, we have 



Thus 



M \P-hs-e- )f > |( -e-^-^Wl 

* -ll^^llll^ie-e-ill 



p-ii0-0")ll<Upi mt0] 



"to" 30 
Together with Eq. (43), we obtain 

-\\r'^\\ 2 < 2 -£m) -«*•)]. (441 

On the other hand, we also have 

1(0) - /(©(*)) = (0 - 0(k) ) T ^p-\ _ W + ~{ & ~ {k) ) T H(0^ + £'(© - (fc) ))(0 - {k) ) 

with < £' < 1. By Theorem 1, we know that for the EM algorithm, 0( fc+1 ) - 0( fc ) = 
P } \fi)_fi)(k)- Utilizing this result in the above equation, we obtain 

/(©(^)) - /(©(*)) = l|P^H 2 0=0 (,) 

M M, D id/(0) ||2 



(*), 



^ V-T^de^e"- (45) 



Combining Eq. (45) and Eq. (44), we obtain 



/(©(*+*)) - /(©(*>) > -(1 - f )^[/(0^) - 1(0*) 



and furthermore 



1(0^)) -1(0*) > [i-(i-_)__][/(0W)-/(0*)] 

M 9?77 2 

- ^-(i-y)^]*^) -/(©*)]■ (46) 
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Let r = [1 — (1 — 4f)^H- Multiplying both sides of the above equation by negative one, 
we obtain Eq. (38). In addition, it is easy to verify that < \r\ < 1 when M < 2 (recall that 
M > to). Furthermore, it follows from Eq. (41) and Eq. (42) that we have 

Z(©W) - 1(0*) < - — ||P"2(0( fc ) - 0*)\\ 2 



which, by Eq. (46), becomes 



_|| P -|( W _ @*)||2 > r k [l(0 o ) - 1(0*)] 



||p-|( (fc) _ @*)|| < \ r \k/2J3.[l(@*) - /(©„)] 

which is just Eq. (39). In addition, when M < 2, \r\ < 1, we have lim^oo 0^ ' = 0* since P 
is positive definite. □ 

We see from this theorem that the EM algorithm converges linearly. Moreover, the speed of 
convergence depends on the difference between M and to: the smaller the difference, the faster 
the convergence. 

Theoretical analysis of an EM algorithm for the hierarchical 
mixture of experts architecture 

An EM algorithm for training the hierarchical architecture 

The ME architecture can be viewed as an architecture for splitting the input space into regions 
in which different local functions are fit. The hierarchical mixture of experts (HME) architecture 
generalizes this idea to a nested model in which regions in the input space are split recursively 
into subregions (Jordan & Jacobs, 1992). The resulting tree-structured architecture can be 
viewed as a multi-resolution function approximator in which smoothed piecewise functions are 
fit at a variety of levels of resolution. 

As shown in Figure 2, the HME architecture is a tree. In this tree, each terminal node is an 
expert network, and each nonterminal node is a root of a subtree which itself corresponds to 
an HME architecture. At every nonterminal node in the tree there is a gating network which is 
responsible for the topmost split of the HME architecture rooted at that node. All of the expert 
networks and the gating networks in the architecture have the same input vector x £ R n . In 
the remainder of this section, as in Jordan and Jacobs (in press), we consider the case in which 
the expert networks and the gating networks are generalized linear models. Furthermore, for 
simplicity, we consider only the case in which the probability model for the experts is gaussian. 

Let us denote a node at depth r by Vi i 1 ...i r . This node is the i r -th. daughter of the node 
v i ii---i r -i- The root node of the tree is f 8o . The number of branches emitted from Vi i 1 ...i r is 
denoted by Ki i 1 ...i r . For simplicity, we can omit %q and write f 8l ... 8? , and Ki 1 ...i r . In addition, 
the output of the subtree rooted at f 8l ... 8? , is denoted 

K 

Yil—ir ~ / y 9il—i r ir+l V X ' "i-i—ir Jy«i-"«V«V+i 

*V+i=i 

where gi 1 ...i r i r+1 is the gating coefficient generated by the gating network attached at f 8l . ..;,,. 

This coefficient satisfies 

Ki 



8J..-8: 

/ y i/«l-"«r«r+l V-"-' "«!• 

*V+i=i 



9il---i r ir+l\ X ->*'li---ir) ~ ' 
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Figure 2: A two-level hierarchical mixture of experts. To form a deeper tree, each expert is 
expanded recursively into a gating network and a set of sub-experts. 



for any x, where 0\ ... 8 - is the parameter vector of the gating network. 

Given a training set y = {(x' f ',yW),i = 1,---,JV}, we want to maximize the likelihood 
function (cf. Eq. 6), that is, 



N 



i = p( {y W}f| {x W}f) = n^(y (f) |x 



(th 



t=i 



Expanding the probability model, we have 



P(yW|xW)=^^(xW,0f o )P(yW|xW,, n ; 



i 1= i 



where 



( ^ K n 



P(yW|xW,^)=< 



if Vi t is a nonterminal node, 

(2vr det S,, )-| e ^^^ ) - f n C^ ^ ^"^ C^ >» 



(47) 



if fj-j is a terminal node 



15 



and recursively 



P(y (f) |x( f U r 



' E&=I ^-.v(x,fl? 1 ...,- r )i'(y w |xW,T;,- 1 ...,- r ,- r+1 ), 

if Vi 1 ...i r is a nonterminal node, 

(27rdet£ n ... 8V )-§ e {-^ [y(t) - f ^ 



if Vi 1 ...i r is a terminal node 



(48) 
where f 8l ... 8? ,(., .) is parameterized function implemented by the expert network at terminal node 
f^. ..;,,, and 6i 1 ...i r , S 8l ... 8? , are the parameters of this expert network; P(yW|x^, v^...^) is the 
probability that yW is generated from the probability model rooted at f 8l ... 8? , when x' f ) is the 
input. 

To derive an EM algorithm for the HME architecture, we attach a set of indicator random 
variables to each nonterminal node v; 



"iX-'-lr 



A*) 



1, if y' f ' is generated by the subtree rooted at f 8l ... 8? ,, 
0, otherwise. 



The missing data y m i s consists of all of the indicator variables attached to the nonterminal 
nodes throughout the tree. In addition, we denote by J 7 ™."- the set consisting of the indicator 



variables attached to the nonterminal nodes in the subtree rooted at v. 



%i ■■■t r - 



We define the distribution of the complete data Z = {y,y m i s } as follows 



N K i 



p(z\0) = n n [^(xw ^jp(2,-jxw t,*)]^ 



w 



(49) 



t=i ii =i 



where 



P(Z tl \^\v tl )= < 



(t) 



life \Siii2 (* W , °i )P{Z ili2 jxW, v lll2 )*i* , 
if Vi t is a nonterminal node, 

(2 7 r^S,)-| e ^^ y(t) - f ^ x(t) ^" TS n 1 [ y(t) - f ^ x(t) ^)]>, 
if Vi t is a terminal node 



(50) 



and recursively 



P(7- ■ lx( f ) v ■ s 



FT n "" r \n- ■ ■ fx 9 ■ )P(Z- ■ ■ Ix^ v ■ ■ )pi 



r(0 



if w,: 



is a nonterminal node, 



(51) 



if Vi 1 ...i r is a terminal node 



It is not difficult to verify that this distribution satisfies Eq. (10) as required. 

We now compute the Q function as required by the E step of the EM algorithm. From Eq. 
(11), we obtain 

N K 'o 

Q(0|0W) = J2 E h^(t){H gn (^\0l)] + lnF n } 
t=i i 1= i 

where 

<k) <t) m g t J^ t \e 9{k) )p( k )(y( t )\^ t ),v t :) 

hf\t) = Euy'yy, & {k) ] = —P± — ' l0 ,,, y ' — — — , (52) 
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\ nF . _J Y,t 2 lihi 1 i 2 ( t ){M. c Ji 1 i 2 ( yLiyt \^ 9 i 1 )\ +ln-fiii 2 } 5 if Vij is a nonterminal node, 
[ lnP(yW|xW, t^), if f 8;1 is a terminal node 

and recursively 



lnP 8l ... 8 v 



' £.v+i=i /l .-i-.v.v + i( i ){ lll [5«-i-«v«v+i( x > fl f 1 ...,v)] + ln - F ii-«v«v+i}> 

if f 8l ... 8 v is a nonterminal node, 

lnP(yW|xW,^ 1 .., v ), 



if Vi 1 ...i r is a terminal node 



gn-^V +1 (xW,< fc UpW(y (f) |x (f) ^n-vv +1 ) 

E&=i r ^.v».v.v +1 (xW,<*i r )i'(*)(yW|xW,T;,- 1 ...,- r ,- r+1 ) ! 



(54) 



where #f ..., is the estimate of Q\ ... 8 - at iteration A;. P{y^ t '\yS t \ Vi 1 ...i r ) is given by Eqs. (47) and 
(48) and P^^y^lx^,?;^. ..;,,) means that the probability is determined with all the parameters 
in the subtree rooted at f 8l ... 8? , being fixed at the estimates obtained in iteration k. 

Proceeding now to the M step of the EM algorithm, we obtain parameter updates by 
optimizing the Q function. If the node f 8l ... 8? , is a terminal node, by setting the partial derivative 
of Q with respect to S 8l ... 8? , equal to zero, we obtain an update for the covariance matrices 

Er=i4* ) (*)^i(*)---^, r («) 

(55) 
To obtain an update for the parameters of the expert networks we differentiate Q with respect 
to 9i 1 ...i r and find that we must solve the following equation 

JV o f T / (t) a \ 

E^O^lCO • ■■^ M n w ' ,1 "- ,J s.-U[y (i) - *,».v(*w, ».-,-,v)] = o. (56) 

In the case of linear expert networks, this equation is a weighted least squares equation, which 
can be solved as follows 

g(k + l) = (R (k) N-l (*) 

where 

JV 

<£U = E^o^lco • • •C(^ i (^. 8 j- 1 y (t) , 

t=i 

and 

JV 

<lv = E^ ) (o^(o---<lv(o^(si 1 \r 1 ^ T - ( 57 ) 

Finally, for any nonterminal node f 8l ... 8? ,, setting the partial derivative of Q with respect to 
0? ...i equal to zero, we have that #f ..., is the solution of the following nonlinear system 

JV -Rt r --8 r o 

E^W€«'-fvW E [^v +1 (0-^V-..V +1 (x(^^ 1 ...,v)]% ±±J - = 0. (58) 
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As in the case of the one-level ME architecture (cf. Eq. (26) and Eq. (28)), we obtain the 
following Newton step for updating the gating network parameters 



where 



<*X 1} = <*l + In-U^r 1 ^ (59) 



N K n ... tr 

t=i «v+i— i 

x [i g n ... tr (x ,» tl ..., r JJ ^ (fc) (fc) (bUj 



9(Ci)^(Cl: 



and 



fS r = E^o^io-'-CvCOx 



Ji Z1 ••• !r o 



x E[e r+1 w-^--M 1 (x (f) ,<^)is^- (6i) 

As in the case of the one-level architecture (cf. Eq. (28)) the algorithm in Eq. (59) is 
essentially the same as the IRLS algorithm suggested by Jordan and Jacobs (in press), although 
we have introduced a learning rate parameter and we restrict the update to a single step. 

In summary, the EM algorithm for the HME architecture is given as follows. 

Algorithm 3 

1. (The E step): Compute the h\*\t), h\ k J 2 (t), ■ ■ ■ , h\*l. ir (t) by Eqs. (52), (53) and (54). 

2. (The M step): Compute the SJ^ by Eq. (55), compute the 0^ k . + ^ by Eq. (59), and 
compute the 6\ ... 8 - by Eq. (57). 



We can think of the E step as assigning credit (posterior probability) to various branches of the 
tree for each data point and the M step as solving weighted least squares problems in which 
the weights are given by the posterior probabilities assigned in the E step. The updates for the 
gating networks simply cache away the posteriors. 

Theoretical convergence results 

Much of the analysis developed in Section 2.2 can be extended to cover the EM algorithm for 
the HME architecture. In this subsection we extend Theorem 1 and Theorem 2 to cover the 
hierarchical case. The results are given as Theorems 3 and 4, respectively. 
We first compute the derivatives of the Q function 



d09 il-ir t = l 
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* i "" r -• ,dsi 



£ K-v+i« -^,..v +1 (x (t) ,^... v )]^f^, (62) 



V-i-1 =1 ? i '" ?r 



*° -e *!rw eiw ••<!,.('> 



^«'i-«V t= i 



^••«v( x(f) ' g 'i-'v) £ -i r v W_f. • ( x « 0. 



dO n ... t 



(63) 



^- - ^^w^w-e^ux 



9E.V...V 2 t=1 



X {S 8l ... 8V -[yW-f n ... 8V (xW,0 n ... 8V )][yW-f n ... 8V (xW,0 n ... 8V )] T }S^ (64) 
and the derivatives of the log likelihood 



f)l N 

«1— «r t = l 

ft' 

Y^ r^W '- s •" ^ m "" ^^ S 



£ K^+i« -^-v +1 (- (f V fc) ^... 8r )]^F^l0 t ^ r= (,)0 t ^ r , (65) 






^^ I m _^»,(*W(*)m...»,(*) 

/)fl ' -0 W ■ 



t = l 



o fl 10. . _0(*) S n -v[y U - f n-v(x u ,^/.. 8V )], (66) 



'l 1 ---l r "^•••ir "ij-.-ir 



9g , 



1 

^E4* ) (o^(«)---^.,v(')(si*Ur 1 >< 

X {sS:i v -[yW-^.^(x(^0(^J][yW-^... v (xW,^. v )] r }(s(^. v )- 1 (67) 



<9X n ... 8 V ^i-^-^j...^ 2 t=1 



Based on these two sets of derivatives, we follow the same line of thought as in the proof of 
Theorem 1 to establish the following theorem for the HME architecture. 

Theorem 3 For the HME architecture of Eqs. (47) and (48), with the parameter updates given 
by Algorithm 3, we have that for every node Vi 1 ...i r 



,g(k+l) _ n9(k) _ D g(fc) 91 , 

h-i r U H-lr — r il-ir f)f)3 1 09 . -QsW , 

°°i 1 --i r "'l->r-"'l->r 

dl , 



(fc+i) _ e (k) = p (k) 

li - lr de ll ... lr l o tl ^ r =o [ , k 1 W 



vecp^-vecp^] = P^ -^ T L y(k) , (68) 

81 tr H %T ^ii-. r dvec[E n ... lr ] '^h-ir^^-ir V ' 

where I = InL is defined by Eqs. (6), (47) and (48). 
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Moreover, assuming that the training set y is generated by the HME model of Eqs. (47) 
and (48), and that the number N is sufficiently large, then -P/..J ?s a positive definite matrix, 
and -Pj ...j , -?£. . are positive definite matrices with probability one. Specifically, they take 
the following values: 

0) Pi}\ = 7f 1 ... 8V « (fc .l)- 1 ™th E?S% given by Eq. (59). 

(H) P nl r = (R^Ur 1 mth R iiU 9iven by Eq. (57). 

(Hi) For Pa , we have 

PW - 2 y (k) y(k) / fiQ X 

& '""E£ 1 M* ) (0'.!ft«)--'>i* ) ..,i<) '■■'"' "'"■"' 

where "(x) " denotes the Kronecker product as defined in Theorem 3. 

From Theorem 5, we can again reach Corollary 1. Again, we see that the EM algorithm 
for training the HME architecture is a type of variable metric gradient ascent algorithm for 
maximizing / = In L. 

Finally, we can also generalize Theorem 2 as follows. 

Theorem 4 Assume that the training set y is generated by the HME model of Eqs. (47) and 
(48) and that the number N is sufficiently large. Assume that S 8l ... 8? , is diagonal and V£ 8 
is a vector consisting of the diagonal elements of S 8l ... 8? ,. 

Let be a vector produced by cascading every vector v 'i 1 ...i r = [(#f ...,- ) T ,0; ...; ,v|. . ] T of 
every node Vi 1 ...i r in the HME architecture. Let P be a diagonal block matrix with each diagonal 
item being a positive diagonal block matrix D\ ... 8 - = diag[Pf ... 8 - ,Pi 1 ...i r ,P^ i ... ; ]. The items of 
and P are arranged in such a way that D\ ... 8 - in P corresponds to v 8l ... 8? , in . 

Furthermore, assume that on a given domain D@ 

(i) The parameterized functions of all the expert networks and the gating networks have 

second order continuous derivatives. 

(ii) The Hessian matrix H(0) = ^ ' T is negative definite; 

8080 

(Hi) 0* is a local maximum of 1(0), and 0* £ T)q. 

Then we have the same conclusion as given in Theorem 2. That is 

(1) Letting —M, —m ( here M > m > 0) be the minimum and maximum eigenvalues of 
the negatively definite matrix (P^) T H(0)(P^) (or equivalently the minimum and maximum 
eigenvalues of PH(0), since we have PHe = Ae from (P^) T HP^e = Xe), we have 

1(0*) - /(©W) < r k [l(0*) - /(©„)], 



|| P -|( W _ @*)|| < | r |fc/2yi[;( *) _ /(0O )] ) 

where r = 1 - (1 - 4f)ir < 1- We also have < \r\ < 1 when M < 2. 
(2) For any initial point 0q £ Dq, lim^oo 0^ ' = 0* when M < 2. 

We should point out that the similarity in the conclusions of Theorem 2 and Theorem 4 does 
not mean that the EM algorithm for the hierarchical architecture has the same convergence 
rate as that for the one-level architecture. In the two cases the matrix P is different, and thus 
M,m,r are also different. This results in different convergence rates. Indeed, in practice, the 
hierarchical architecture is usually faster. The similarity in the conclusions of the two theorems 
does mean, however, that the convergence rates for both the algorithms are of the same (linear) 
order. 
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Variants of the EM Algorithm and Simulations 

Variants of the EM algorithm 

For convenience, we denote an EM update of the parameter vector as follows 

6>( fc+1 ) = E/p(0(*)). (70) 

From Theorems 1 and 2 and Theorems 3 and 4, we see that this update is actually a line search 
method along an ascent direction of / = In L 

e (k+i) _ e (k) , p (k)dl, (71) 

v -v + r Q d0 \ 0=0 w, (n) 

with Pi ' being a positive definite matrix evaluated at 6^ '. Moreover this update has a linear 
convergence rate. This link between the EM algorithm and conventional gradient-based opti- 
mization techniques suggests the possibility of using acceleration techniques for improving the 
convergence. In the sequel we suggest two such acceleration techniques. 



Modified line search 

Eq. (71) can be replaced by a modified line search 

0(*+D = 0^ + X k d k , 

d k = P e k) ^\ e = (k) = u p (o (k) )-0 (k \ (72) 

where A^ is a stepsize which is optimized by maximizing 1(0^ ' + A^d^) with respect to A^ via 
a one- dimensional search (e.g., Fibonacci search). 

The implementation of a one- dimensional optimization method at every parameter update 
is typically expensive. One often uses an inaccurate line search by decreasing (e.g., A^ — ► rXk) 
or increasing (e.g., A^ —> -Xk) the stepsize heuristically according to a stopping rule. One 
frequently used stopping rule is the so-called Goldstein test (Luenberger, 1984). The Goldstein 
test is implemented as follows 



/(A fc ) 


< Z(0) + eZ'(0)A fc , 


/(A*) 


> Z(0) + (l-e)Z'(0)A fc , 


/'(0) 


,t dl , „( k ) 



:73) 



where < e < 1 is a specified error bound. 
Interestingly, if we rewrite the update as 



fc)N 



0(*+i) = W + \ k [u p (0W) - oW\ = (1 - X k )0^ + \ k u p (o( 

we find that Eq. (72) is identical to the speedup technique for the EM algorithm studied 
by Peters k Walker (1978a,b), Meilijson(1989) and Redner k Walker (1984). These authors 
reported a significant speedup for an appropriately selected A^ even without the Goldstein test. 
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A speeding up formula based on locally linearization 

Using a first-order Taylor expansion of U p (6^ ') around 6^ _1 ', we have, approximately, 

Q(k+i) = e (k) + B (Q(k) _ 0(fc-i)) 

or 

AO k = BAO k - 1 , (74) 

where B = ^\ Q = 0( fc ) and A0 k = 0( fc+1 ) - 0( fc ). 

For the matrix B, we have the following characteristic equation 

n 

det(AJ - B) = X n + ^X 1 - 1 + ■■■ + ii n _ x A + ii n = X n + ]T H ^~ J = 

H = ( _1 ) J H A n A n • • • A *j 0' = !> 2 > • • • ' ra ) 

where A^i = 1, • • -,n are the eigenvalues of B. It follows from the Cayley- Hamilton theorem 
that 

n 

B n + J2^j Bn ~ J = o 

Multiplying by A0 k _ n (k > n), we have 

£ n A0 fc _ n + J2 ^B n -JAO k _ n = 0. (75) 

From Eq. (74), we obtain 

AOk-j = BAOk-j-! =--- = B n ~JAO k _ n , j = 0,l,...,n. 

Substituting into Eq. (75), we have 

n 

AO k + J2vA e k- 3 =Q- (76) 

Assuming that in comparison with the first / eigenvalues, the remaining eigenvalues can be 
neglected, we have approximately 

l / 

X n + J2 /^"~ j = 0, B n + J2 ^ n ~ 3 = 
j=i j=i 

and correspondingly 

^ + ^M J A0 8 _ J = O, i = k,k+l,---. (77) 

j=i 
The approximation becomes exact when the last n — I eigenvalues are zero. 

By minimizing \\6 k + ^7=1 Mj^^fc-jH 2 , we obtain the following linear equation for solving 
li = [^i,---,^/] T 

S fi = s , S = [sij]i x i, s = [-S01, -S02, ■ ■ •, -soi] , 
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{A9 k _ t ) T (A9 k _ J ), (78) 

Moreover, we have 

CO / / CO / CO 

i=k j=0 j=0 i=k+l j=l i=k+l 



where fj,Q = 1. From Eq. (77) and 



i=k+l 



(where 9* = lim^oo A9 k ), we have 



l 
-A0 k+1 + 9* + ]>>,(- A0 fc+1 _, + 0*) 



we finally obtain 



Using this equation together with 

l l l-i l 

E^jAflfc+i-j = A9 k+1 Y,»j ~ Yli Y, V]] Ae k-t 

3=0 3=0 i=0 3=1+1 

V k +i = Vk+i =j • (79) 

1^3=0 ^3 

This formula can be used for speeding up the EM algorithm in two ways. 

• Given an initial 9q, we compute via the EM update 9o, 9\ , • • • , 0/+i, and then from 9j,j = 
1, • • • , /, we solve /ii, • • • , /i; utilizing Eq. (78). This yields a new 9* +1 via Eq. (79). We 
then let 9q = 9* +1 and repeat the cycle. 

• Instead of starting a new cycle after obtaining 9* k+1 , we simply let 0/ + i = 9* +1 , and use 
the EM update (Eq. 70) to obtain a new 0/+2, then we use 9\,9\, ■ ■ -,#/+2 to get a #* +2 - 
Similarly, after getting 9* k ,k > /, we let 9 k = 9\ and use the EM update to get a new 
9 k+1 , and then use k -i, • • • , 9 k+1 to get a 9* k+1 . 



Specifically, when / = 1, we have 



A9 k (A9 k ) 1 (A9 k _ 1 \ 



dl+1 - dk + TT7i' ^ ~ ~ (A9 k _ 1 nA9l 1 y (80) 

In this case, the extra computation required by the acceleration technique is quite small, and 
we recommend the use of this approach in practice. 

Simulations 

We conducted two sets of computer simulations to compare the performance of the EM al- 
gorithm with the two variants described in the previous section. The training data for 
each simulation consisted of 1000 data points generated from the piecewise linear function 

y = a\x + <22 + n t , x £ [^L^fy] an d y = a[x + a' 2 + n t , x £ \x' l ,x'jj\, where n t is a gaussian 
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random variable with zero mean and variance a = 0.3. Training data were sampled from the 
first function with probability 0.4 and from the second function with probability 0.6. 

We studied a modular architecture with K = 2 expert networks. The experts were linear; 
that is, fj(x.( t \6j) were linear functions [x, 1] Oj. For the gating net, we have 

Sj = [X, 1J Ogj, 

where gj is given by Eq. (2). For simplicity, we updated g j by gradient ascent 

e (k+i) _ Q (k) dQ_ (m 

Uf7 gj 

The learning rate parameter was set to r g = 0.05 for the first data set and r g = 0.002 for the 

second data set. We used Eq. (20) and Eq. (22) to update the parameters 6- , j = 1,2 and 

a- , j = 1,2, respectively. 

The initial values of 6 q -,6- , a- , j = 1,2 were picked randomly. To compare the perfor- 
mance of the algorithms, we let each algorithm start from the same set of initial values. 
The first data set (see Figure 3(a)) was generated using the following parameter values 

al = 0.8, a2 = 0.4, x L = -1.0, x v = 1.0, a[ = -1.0, a' 2 = 3.6, x' L = 2.0, x' v = 4.0. 

The performance of the original algorithm, the modified line search variant with A^ = 1.1, the 
modified line search variant with A^ = 0.5, and the algorithm based on local linearization are 
shown in Figures 4, 5, 6, and 7, respectively. As seen in Figures 4(a) and 5(a), the log likelihood 
converged after 19 steps using both the original algorithm and the modified line search variant 
with Afc = 1.1. When a smaller value was used (A^ = 0.5), the algorithm converged after 24 steps 
(Figure 6(a)). Trying other values of A^, we verified that A^ < 1 slows down the convergence, 
while Afc > 1 may speed up the convergence (cf. Redner & Walker, 1984). We found, however, 
that the outcome was quite sensitive to the selection of the value of A^. For example, setting 
\k = 1.2 led the algorithm to diverge. Allowing A^ to be determined by the Goldstein test 
(Eq. 73) yielded results similar to the original algorithm, but required more computer time. 
Finally, Figure 7(a) shows that the algorithm based on local linearization yielded substantially 
improved convergence — the log likelihood converged after only 8 steps. 

Figures 4(b) and 4(c) show the evolution of the parameters for the first expert net and 
the second expert net, respectively. Comparison of these figures to Figures 5(b) and 5(c) 
shows that the original algorithm and the modified line search variant with A^ = 1.1 behaved 
almost identically: 9\ converged to the correct solution after about 18 steps in either case. 
Figures 6(b) and 6(c) show the slowdown obtained by using A^ = 0.5. Figures 7(b) and 7(c) 
show the improved performance obtained using the local linearization algorithm. In this case, 
the weight vectors converged to the correct values within 7 steps. 

Panel (d) in each of the figures show the evolution of the estimated variances o\,o\. The 
results were similar to those for the expert net parameters. Again, the algorithm based on local 
linearization yielded significantly faster convergence than the other algorithms. 

A second simulation was run using the following parameter values (see Figure 3(b)) 

al = 0.8, a2 = 0.4, x L = -1.0, x v = 2.0, a[ = -1.2, a' 2 = 2.4, x' L = 1.0, x' v = 4.0. 

The results obtained in this simulation were similar to those obtained in the first simulation. 
The EM algorithm converged in 11 steps and the local linearization algorithm converged in 6 
steps. 

24 



&?r. 






-1 -0.5 0.5 1 



1.5 



(a) 






.i 



„»&»° 



°«* 



„ °o « 




Figure 3: The data sets for the simulation experiments. 
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Figure 4: The performance of the original EM algorithm: (a) The evolution of the log likelihood; 
(b) the evolution of the parameters for expert network 1; (c) the evolution of the parameters 
for expert network 2; (d) the evolution of the variances. 
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Figure 5: The performance of the linesearch variant with A^ = 1.1. 
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Figure 6: The performance of the linesearch variant with A^ = 0.5. 
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Figure 7: The performance of the algorithm based on local linearization. 
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The results from a number of other simulation experiments confirmed the results reported 
here. In general the algorithm based on local linearization provided significantly faster con- 
vergence than the original EM algorithm. The modified line search variant did not appear to 
converge faster (if the parameter A^ was fixed). We also tested gradient ascent in these exper- 
iments and found that convergence was generally one to two orders of magnitude slower than 
convergence of the EM algorithm and its variants. Moreover, convergence of gradient ascent 
was rather sensitive to the learning rate and the initial values of the parameters. 

Concluding remarks 

Finite mixture models have become increasingly popular as models for unsupervised learning, 
partly because they occupy an interesting niche between parametric and nonparametric ap- 
proaches to statistical estimation. Mixture-based approaches are parametric in that particular 
parametric forms must be chosen for the component densities, but they can also be regarded as 
nonparametric by allowing the number of components of the mixture to grow. The advantage of 
this niche in statistical theory is that these models have much of the flexibility of nonparametric 
approaches, but retain some of the analytical advantages of parametric approaches (McLachlan 
& Basford, 1988). Similar remarks can be made in the case of supervised learning: The ME 
architecture and the HME architecture provide flexible models for general nonlinear regres- 
sion while retaining a strong flavor of parametric statistics. The latter model, in particular, 
compares favorably to decision tree models in this regard (Jordan & Jacobs, in press). 

In the current paper we have contributed to the theory of mixture-based supervised learning. 
We have analyzed an EM algorithm for ME and HME architectures and provided theorems on 
the convergence of this algorithm. In particular, we have shown that learning algorithm can be 
regarded as a variable metric algorithm with its metric matrix P being positive definite, so that 
the searching direction of the algorithm always has a positive projection on the gradient of the 
log likelihood. We have shown that the algorithm converges linearly, with a rate determined 
by the difference between the minimal and maximal eigenvalues of a negative definite matrix. 

Similar results to those obtained here can also be obtained for the case of the unsupervised 
learning of finite mixtures (Xu & Jordan, 1993). 

References 

Baum, L.E., & Sell, G.R. (1968). Growth transformation for functions on manifolds. Pac. J. 
Math., 27, 211-227. 

Baum, L.E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in 
the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical 
Statistics, ^1, 164-171. 

Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and Regression 
Trees. Belmont, CA: Wadsworth International Group. 

De Veaux, R.D. (1986). Parameter estimation for a mixture of linear regressions. Ph.D. 
Dissertation and Tech. Rept. No. 247, Department of Statistics, Stanford University, Stanford, 
CA. 

Dempster, A. P., Laird, N.M., & Rubin, D.B. (1977). Maximum-likelihood from incomplete 

30 



data via the EM algorithm. J. of Royal Statistical Society, B39, 1-38. 

Friedman, J.H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 
1-141. 

Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures of local 
experts. Neural Computation, 3, 79-87. 

Jordan, M.I. & Jacobs, R.A. (1992). Hierarchies of adaptive experts. In J.E. Moody, S. Hanson 
& R.P. Lippmann, (Eds.), Advances in Neural Information Processing System J h San Mateo: 
Morgan Kaufmann, 985-992. 

Jordan, M.I. & Jacobs, R.A. (in press). Hierarchical mixtures of experts and the EM algorithm. 
Neural Computation. 

Little, R.J. A., & Rubin, D.B. (1987). Statistical Analysis with Missing Data. New York: John 
Wiley. 

McCullagh, P. & Nelder, J. A. (1983). Generalized Linear Models. London: Chapman and Hall. 

McLachlan, G.J., & Basford, K.E. (1988). Mixture Models: Inference and Application to Clus- 
tering. New York: Dekker. 

Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. J. of Royal 
Statistical Society, B51, 127-138. 

Peters, B.C. & Walker, H.F. (1978a). An iterative procedure for obtaining maximum-likelihood 
estimates of the parameters for a mixture of normal distributions. SIAM J. Applied Mathemat- 
ics, 35, 362-378. 

Peters, B.C., & Walker, H.F. (1978b). The numerical evaluation of the maximum-likelihood 
estimates of a subset of mixture proportions, SIAM J. Applied Mathematics, 35, 447-452. 

Quandt, R.E. & Ramsey, J.B. (1972). A new approach to estimating switching regressions. J. 
American Statistical Society, 61, 306-310. 

Quandt, R.E. & Ramsey, J.B. (1978). Estimating mixtures of normal distribution and switching 
regressions. J. American Statistical Society, 13, 730-738. 

Redner, R.A., & Walker, H.F. (1984). Mixture densities, maximum likelihood, and the EM 
algorithm. SIAM Review 26, 195-239. 

Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. The Annals of Statis- 
tics, 11, 95-103. 

Xu, L., & Jordan, M.I. (1993). Theoretical and experimental studies of the EM algorithm 
for unsupervised learning based on finite Gaussian mixtures. MIT Computational Cognitive 
Science Tech. Rep. 9302, MIT, Cambridge, MA. 

Appendix 

The theoretical results presented in the main text show that the EM algorithm for the ME 
and HME architectures converges linearly with a rate determined by the condition number of a 
particular matrix. These results were obtained for a special case in which the expert networks 
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are linear with a gaussian probability model and the gating networks are multinomial logit 
models. In this section we discuss extensions of these results to other architectures. 

We first note that Theorems 2 and 4 make no specific reference to the particular probability 
models utilized in specifying the architecture. The results on convergence rate in these theorems 
require only that the matrix P be positive definite. These theorems apply directly to other 
architectures if the corresponding P matrices can be shown to be positive definite. We therefore 
need only consider generalizations of Theorem 1, the theorem which established the positive 
definiteness of P for the generalized linear ME architectures. An analogous generalization of 
Theorem 3 for the HME architectures can also be obtained. 

Let us consider the case in which the function implemented by each expert network 
(fj(x, 0j)) is nonlinear in the parameters. We consider two possible updates for the param- 
eters: (1) a gradient algorithm 

0f +1) =0f + 7 jef, (82) 



where 



t=i 
and (2) a Newton algorithm: 



*f = XfflO g g (*) W )"V f) - f,(x« 0f)], (83) 



^+i) = ^) +7 . (iE (*) ) -i e (*) ) (84) 



where 



N 



^ = Eh?)(J) W^> (S? .,-.£W^, (88) 

3 ^i J oof J d(of ] ) T 

where jj > is a learning rate. 

These updates are covered by the following extension of Theorem 1. 

Theorem 1A 1 For the model given by Eq. (4) and the updates given by Eq. (83) or Eq. 
(85), we have: 

e {k+i) _ Q {k) _ p {k)dl_, ? -i ... K 

where P- ' is positive definite. 



(k) 

Proof. For the gradient descent algorithm, we have P- ' = JjIk, which is obviously positive 

definite because jj > 0. For the Newton algorithm, we have that P- ' = ~/j(Rj ) _1 - We now 
show that this matrix is positive definite. For an arbitrary vector u, we have 

3 ti J dof J d(ef ] ) T 

N 

t=l 

> o. 
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T 9f T (x( t ),0 ( / :) ) . (k) . . . 

Equality holds only when v = u — - — , k < — = 0, since Yj- is positive definite with probability 

3 

one. This is impossible for any u. So with probability one, R- ' (and thus (R- ')~ 1 also) is 
positive definite. □ 

Note that the Newton update (Eq. 85) is particularly appropriate for the case in which the 
experts are generalized linear models (McCullagh & Nelder, 1983); that is, the case in which 
fj(xW, 0j) = [/ji(xW, Oj), ■ ■ ■ , /jd a (x(*), 0j)] (d y is the dimension of y) with 

/ji(x (f) ,flj) = F Jt ([0 jA , ■ ■ • , h m ] T x« + e jim+1 ), 

where Fji(.) is a continuous univariate nonlinear function known as the link function. In this 
case the Newton algorithm reduces to the IRLS algorithm. The extension to generalized linear 
models also allows probability models from the generalized exponential family (cf. Jordan & 
Jacobs, in press) and Theorem 1A is applicable to this case as well. 

We can also consider the case in which the gating network is nonlinear in the parameters. 
Both the Newton update (IRLS update) and the gradient update are applicable in this case. 
Theorem 1 already established that the Newton update for the gating network involves a positive 
definite P g matrix. As in Theorem 1A, the result for the gradient update is immediate. 
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